Skip to content

fix: support name look-up on InstanceExists() and InstanceShutdown() if no providerID set#245

Merged
bwagner5 merged 5 commits into
oxidecomputer:mainfrom
bwagner5:fix-no-prov-id
May 30, 2026
Merged

fix: support name look-up on InstanceExists() and InstanceShutdown() if no providerID set#245
bwagner5 merged 5 commits into
oxidecomputer:mainfrom
bwagner5:fix-no-prov-id

Conversation

@bwagner5
Copy link
Copy Markdown
Contributor

Changes

  • Supports a name look-up if providerID is not set rather than erroring in InstanceExists() and InstanceShutdown().
  • Adds tests for InstancesV2.
  • Removes unused K8s client in InstancesV2

Why?

There is a comment on the interface of InstanceV2 saying providerID and name should be supported.

Apart from the interface comment, the edge case where a name look-up helps:

  1. A node comes online while CCM is unhealthy due to an update or some other issue.
  2. The node shuts down before CCM becomes healthy.
  3. CCM comes back up, sees the node, but InstanceMetadata() andInstanceExists() return errors since there is no providerID set.
  4. The Node Lifecycle Controller does not automatically delete the node, so it stays there forever (or until a user manually deletes the node).

Ideally, we should clean-up the node in this case since it in-fact does not exist.

NOTE: The name look-up only works if the node is registered with the Name of the Oxide instance. This should be pretty standard though. It's more standard to use the hostname as the node name of a K8s node, which in most cases will be the Oxide instance name. We could match based on hostname, but the query would be much more expensive since we'd have to resort to a List rather than a direct Fetch (View).

Testing

  1. Turn off CCM (scale to 0)
  2. Start Oxide Instance and join to K8s cluster
  3. Stop and Delete Oxide Instance
  4. Scale up CCM to 1 replica

v0.6.0 the node will continue to exist forever and CCM logs show:

node_lifecycle_controller.go:156] error checking if node oxide-w2 exists: failed retrieving instance id from provider id: provider id is empty
  1. Deploy this CCM patch. The Node resource is deleted since InstanceExists returns false (from name lookup).

@bwagner5 bwagner5 requested a review from a team as a code owner May 29, 2026 19:04
Copy link
Copy Markdown
Collaborator

@sudomateo sudomateo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor suggestions. Overall this looks great. I was a bit on the fence about returning true, nil in InstanceShutdown when the instance did not exist but I think that's the correct move to let the CCM stop worrying about the node.

Comment thread internal/provider/instances_v2_test.go
if errors.Is(err, oxide.ErrObjectNotFound) {
return true, nil
}
return false, err
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either here or in getInstance we should decorate the error with the instance ID we tried to fetch. I don't remember if the outer Kubernetes calling code has that information already.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getInstance returns what we try to look up in the wrapped error. If it can't parse the provider ID, it's printed. If it uses the fallback path when the providerID is not set, it wraps the error returned from the SDK. The wrapped SDK error is admittedly ugly, so will need to figure out how to format that better into a structured line...

│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd E0529 21:35:34.515314       1 node_controller.go:288] Error getting instance metadata for node addresses: failed viewing oxide instance: GET https://oxide<blah>/v1/instances/46221934-d503-43ea-825f-730ada60d85c                             │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd ----------- RESPONSE -----------                                                                                                                                                                                                                                  │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd Status: 404 ObjectNotFound                                                                                                                                                                                                                                        │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd Message: not found: instance with id "46221934-d503-43ea-825f-730ada60d85c"                                                                                                                                                                                       │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd RequestID: 9163f53f-95b7-4c6b-8d61-538df4c4c408                                                                                                                                                                                                                   │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd ------- RESPONSE HEADERS -------                                                                                                                                                                                                                                  │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd Content-Type: [application/json]                                                                                                                                                                                                                                  │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd X-Request-Id: [9163f53f-95b7-4c6b-8d61-538df4c4c408]                                                                                                                                                                                                              │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd Date: [Fri, 29 May 2026 21:35:34 GMT]                                                                                                                                                                                                                             │
│ oxide-ccm-oxide-cloud-controller-manager-776766d446-2zmhd Content-Length: [177]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's coming from oxide.go here: https://github.com/oxidecomputer/oxide.go/blob/93d8204aae4d37f5c56c90ea94764f5d4b0cce3a/oxide/errors.go#L96-L133

We have RFD-614 to discuss how we'd like to improve custom errors but we've stalled a bit on finishing it since we've implemented some of the stuff discussed there already. Ideally we'd pass the error to some structured logging package and it calls some structured printing method instead of Error() string. Either way that's another discussion for oxide.go.

@bwagner5 bwagner5 requested a review from sudomateo May 29, 2026 21:41
Copy link
Copy Markdown
Collaborator

@sudomateo sudomateo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@bwagner5 bwagner5 merged commit 64672c9 into oxidecomputer:main May 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants